<!DOCTYPE html>
<html lang="en">
<head>
  <title>Apache Kafka: The Basics</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
  <link rel="stylesheet" type="text/css" href="/main.css">
  <link rel="stylesheet" type="text/css" href="main.css">
</head>

<body>
<header>
  <h1>
    <a href="/">The Secret Lives of Data</a>
  </h1>
  <hr/>
</header>


<article>
<h1>Apache Kafka: The Basics</h1>
by <a href="https://twitter.com/benbjohnson" target="_blank">Ben Johnson</a>
&nbsp;&nbsp;&nbsp;
<!-- TODO: Remove WIP warning -->
<a href="https://github.com/benbjohnson/thesecretlivesofdata/issues/4" target="_blank"><img src="http://img.shields.io/badge/status-work%20in%20progress-red.svg?style=flat-square"/></a>

<p>There are many tools in a software developer's toolbox that add complexity
to a system but the Apache Kafka project is one of the few that simplifies it.
Kafka is a distributed log that can be used as a messaging system between
different components of a distributed system.</p>

<p>This is a simple, interactive introduction to the concepts behind this
log-oriented messaging system. We'll cover replication and internals in
future posts.</p>

<h2>Introduction to log processing</h2>

<p>Many developers think of a log as somewhere they write their error messages
but we're talking about a different kind of log. The log we're discussing 
is append-only log that contains a series of commands that each have a
sequential identifier and some data:</p>

<div id="intro0" class="svg intro" style="height: 160px"></div>

<p>In this example, each entry is a command to either add to or subtract from
the current value, <em>V</em>, in our application. This is a very simple example
of a log but it illustrates many of the important points of a log.</p>


<h3>Mechanical sympathy</h3>

<p>One important reason to use a log is that it is extremely efficient.
<a href="http://mechanical-sympathy.blogspot.com/">Mechanical sympathy</a> means
to write software that is in harmony with how the underlying hardware works so
you can use it most efficiently.</p>

<p>Your hard disk and RAM have the highest throughput when you read and write to
them sequentially. The performance difference can be an order of magnitude over
reading and writing data randomly. Logs take advantage of this by ordering
entries one after another.</p>

<p>Many systems are optimized using an internal log. Most databases, for
example, use a write-ahead log (WAL). The WAL keeps a record of changes so
they're not lost during a system crash and the changes are applied in bulk to
the underlying data store. Without a WAL, a database would have to make
expensive random writes and disk syncs on every operation.</p>


<h3>Replaying the log</h3>

<p>Another benefit of a log is that changes can be replayed so you can see the
state of your system at any given point in time. More importantly in distributed
systems, though, is that you can copy the state of a system to another machine
and replay all changes after the copy started to keep two systems in sync.</p>

<p><em>VIZ: Copy state and replay log.</em></p>

<p></p>


<h3>Logical clocks</h3>

<p></p>


<h2>Producing messages</h2>
<h2>Consuming messages</h2>
<h2>Segmenting messages</h2>
<h2>Parallelizing topics</h2>

</article>

<script src="/scripts/d3/d3-3.3.9.min.js"></script>
<script src="main.js"></script>
</body>
</html>
